Back

LLM Fine-Tuning for AKGC

Fine-tuning Mistral 7B with QLoRA for Automatic Knowledge Graph Construction from unstructured text.

This project was developed for the Natural Language Processing course in the Master's Degree in Data Science.

Building Knowledge Graphs from unstructured text has traditionally required complex pipelines involving multiple specialized models, posing a significant barrier for many developers and organizations. My goal was to create a more accessible alternative by fine-tuning a Large Language Model (LLM) for this specific task.
I proposed a novel approach to Automatic Knowledge Graph Construction (AKGC) using Mistral 7B, a cutting-edge open-source LLM. The key innovation was enabling users to define their desired Knowledge Graph schema via prompts, granting control over what information to extract and how to structure it. To support large-scale fine-tuning on limited hardware, I adopted QLoRA for memory-efficient training on a single GPU.
A comprehensive synthetic dataset was generated from six high-quality Neo4j graph databases, selected for their relevance and real-world applicability. This dataset was used to train and evaluate multiple AKGC approaches. The results were compelling: the fine-tuned model achieved 81% precision and 77% recall on familiar graph schemas, substantially outperforming the base model. More importantly, it demonstrated strong zero-shot generalization to entirely new schemas, showcasing the model’s ability to adapt beyond seen structures.
Rigorous experimentation revealed valuable insights. While zero-shot prompting performed well for single-document extraction, achieving consistency across multiple documents required few-shot examples to standardize entity formatting. This led to the development of an improved version optimized for few-shot learning, which surpassed both the base model and the initial fine-tuned version in multi-document scenarios.
A particularly noteworthy discovery was that early stopping during fine-tuning, after just ~50 steps, resulted in better generalization to unseen schemas. This suggests that extended training may lead the model to overfit specific database structures, rather than learning transferable AKGC strategies.
The final implementation provides a scalable and user-friendly alternative to traditional AKGC pipelines, enabling structured knowledge extraction with minimal resources. Beyond the technical contribution, the project surfaced key research directions, such as the need for more diverse training data and better fine-tuning heuristics.
This project reflects my ability to address complex NLP challenges, apply state-of-the-art machine learning techniques, and deliver practical tools that make advanced AI technology more accessible to a broader audience.

Tags

Fine Tuning PyTorch Mistral 7B Large Language Models QLoRA AKGC